Distinguishing Humans from Bots in Web Search Logs

نویسندگان

  • Omer M. Duskin
  • Dror G. Feitelson
چکیده

Cleaning workload data and separating it into classes is a necessary pre-requisite for workload characterization. In particular, the workload on web search engines is derived from the activities of both human users and automated bots. It is important to distinguish between these two classes in order to reliably characterize human web search behavior, and to study the effects of bot activity. However, available workload data is not accompanied by labels that can be used as a basis for learning and generalization. To cope with the lack of labeled data, we suggest using two mechanisms. The first is to employ two thresholds for each criterion, enabling the identification of users who are most probably human or most probably bots according to need, and avoiding ambivalent cases. The second is the notion of “strong” criteria, which identify levels of activity which are highly unlikely or even impossible for humans to achieve. We then use an iterative process of refining the thresholds to combine the results of multiple metrics in a mutually consistent manner. Results using the AOL log identify over 92% of the users as human, and only a small fraction (0.6%) are probable bots. The humans tend to display relatively consistent behavior, whereas bots may exhibit markedly different behaviors. In particular, it is not uncommon for a bot to be very different from typical human behavior according to one criterion, while being indistinguishable from a human according to another.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Image flip CAPTCHA

The massive and automated access to Web resources through robots has made it essential for Web service providers to make some conclusion about whether the "user" is a human or a robot. A Human Interaction Proof (HIP) like Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) offers a way to make such a distinction. CAPTCHA is a reverse Turing test used by Web serv...

متن کامل

Bot or Not? A Case Study on Bot Recognition from Web Session Logs

This work reports on a study of web usage logs to verify whether it is possible to achieve good recognition rates in the task of distinguishing between human users and automated bots using computational intelligence techniques. Two problem statements are given, offline (for completed sessions) and on-line (for sequences of individual HTTP requests). The former is solved with several standard co...

متن کامل

Distinguishing Humans from Robots in Web Search Logs

The workload on web search engines is actually multiclass, being derived from the activities of both human users and automated robots. It is important to distinguish between these two classes in order to reliably characterize human web search behavior, and to study the effect of robot activity. We suggest an approach based on a multidimensional characterization of search sessions, and take firs...

متن کامل

Bots are Users, Too! Rethinking the Roles of Software Agents in HCI

Increasingly sophisticated autonomous software agents called ’bots’ roam throughout the Internet, performing a wide variety of tasks, some for good and some for evil. Yet while autonomous, these bots are not artificial intelligences, instead programmed to perform mundane, routine tasks that would otherwise be impossible by humans. Useful bots crawl the web for search engines, enforce order in I...

متن کامل

Do Bots impact Twitter activity?

The WWW has seen massive growth in population of automated programs (bots) for a variety of exploits on online social networks (OSNs). In this paper we extend on our previous work to study the affects of bots on Twitter. By setting up a bot account on Twitter and conducting analysis on a click logs dataset from our web server, we show that despite bots being in smaller numbers, they exercise a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010